Challenge Objective

Your goal is to forecast revenue, and you’d like to understand if their are relationships between Revenue trends / spikes and webpage traffic (do spikes in revenue relate to traffic being driven to specific pages?).

Libraries

library(tidyverse)
library(timetk)
library(lubridate)
library(DataExplorer)

Data

Read in the following data sets.

Transactions Revenue

transactions_tbl  <- read_rds("../00_data/transactions_weekly.rds")
transactions_tbl
## # A tibble: 91 x 2
##    purchased_at revenue
##    <date>         <dbl>
##  1 2018-06-03    32736.
##  2 2018-06-10    38290.
##  3 2018-06-17    39974.
##  4 2018-06-24    35622.
##  5 2018-07-01    28984.
##  6 2018-07-08    47251.
##  7 2018-07-15    31512.
##  8 2018-07-22    50986.
##  9 2018-07-29    41183.
## 10 2018-08-05    36598.
## # … with 81 more rows

Google Analytics by Page

google_analytics_by_page_tbl <- read_rds("../00_data/google_analytics_by_page_daily.rds")
google_analytics_by_page_tbl
## # A tibble: 4,685 x 6
##    date       pagePath                  pageViews organicSearches sessions  rank
##    <date>     <chr>                         <dbl>           <dbl>    <dbl> <int>
##  1 2019-05-08 /                               246              29      125     1
##  2 2019-05-08 /business/2016/08/07/Cus…        38              27       32    11
##  3 2019-05-08 /business/2017/09/18/hr_…        38              28       35     8
##  4 2019-05-08 /business/2017/10/16/sal…        44              30       36    12
##  5 2019-05-08 /business/2017/12/27/six…        28              14       22     9
##  6 2019-05-08 /business/2018/10/08/pyt…        36              28       31    13
##  7 2019-05-08 /business/2019/03/11/ab-…        20              15       16    10
##  8 2019-05-08 /learn.html                      12               1        6    19
##  9 2019-05-08 /p/ds4b-101-r-business-a…         8               1        5    14
## 10 2019-05-08 /p/jumpstart-with-r              71               5       28     6
## # … with 4,675 more rows

Visualize the Data

Transactions

  • Start with transactions_tbl
  • Use plot_time_series() to visualize purchased at vs revenue
  • Answer the question: What is the time frequency?
transactions_tbl %>%
    plot_time_series(purchased_at, revenue)

ANSWER: Weekly

Sessions by Page

There are 20 pages x 3 Metrics (pageViews, organicSearch, and sessions). To make it easier to visualize, do this:

  1. Start with google_analytics_by_page_tbl
  2. Group by “pagePath” - These are the pages that are being visited
  3. Visualize just the “sessions” - This will cut down the data
  4. Use plot_time_series() with .interactive = FALSE and .facet_ncol = 4 (this helps to visualize all of the facets)
google_analytics_by_page_tbl %>%
    group_by(pagePath) %>%
    plot_time_series(date, pageViews, pagePath, .interactive = FALSE, .facet_ncol = 4)

Aggregation to Common Frequency

We need to aggregate both data sets to get them on a common frequency before we can join them

Aggregate Transactions to Weekly

  1. Start with transactions_tbl
  2. Use summarise_by_time() with .by = "week", and sum() the revenue.
  3. Save as a new variable called transactions_weekly_tbl
transactions_weekly_tbl <- transactions_tbl %>%
    summarise_by_time(purchased_at, "week", revenue = sum(revenue))

Aggregate GA Page Sessions to Weekly

  1. Start with google_analytics_by_page_tbl
  2. Select just the “date”, “pagePath” and “sessions” columns.
  3. Group by “pagePath”
  4. Use summarise_by_time() with .by = "week", and sum() the sessions.
  5. Use pivot_wider() to pivot the names from “pagePath” and values from “sessions” to a wide data. Use names_prefix = "sessions_" to identify the new columns as coming from sessions.
  6. Select “date” and any columns that contains("/p/") (these are product pages)
  7. Store the wrangled data as product_page_sessions_weekly_tbl
product_page_sessions_weekly_tbl <- google_analytics_by_page_tbl %>%
    select(date, pagePath, sessions) %>%
    group_by(pagePath) %>%
    summarise_by_time(date, "week", sessions = sum(sessions)) %>%
    pivot_wider(
        names_from = pagePath,
        values_from = sessions,
        names_prefix = "sessions_"
    ) %>%
    select(date, contains("/p/"))

Join the Datasets

Left Join

  • Use left_join() to join transactions_weekly_tbl and product_page_sessions_weekly_tbl
  • Store the joined data as transactions_product_page_sessions_weekly_tbl
transactions_product_page_sessions_weekly_tbl <- transactions_weekly_tbl %>%
    left_join(
        product_page_sessions_weekly_tbl,
        by = c("purchased_at" = "date")
    )

Inspect Missing

  • Use plot_missing() to inspect the missing data in transactions_product_page_sessions_weekly_tbl
transactions_product_page_sessions_weekly_tbl %>% plot_missing()

Identify Relationships

# Load Checkpoint Data
transactions_product_page_sessions_weekly_tbl <- 
    read_rds("challenge_01_data_checkpoints/transactions_product_page_sessions_weekly_tbl.rds")

Visualize the Joined Data

  1. Start with transactions_product_page_sessions_weekly_tbl
  2. pivot_longer() everything except “purchased_at” to form the data for plotting. This creates 2 columns, “name” and “value”.
  3. Group by “name”
  4. Use plot_time_series() to visualize “purchased_at” vs “value”. Use `.facet_ncol = 3
transactions_product_page_sessions_weekly_tbl %>%
    pivot_longer(-purchased_at) %>%
    group_by(name) %>%
    plot_time_series(purchased_at, value, .facet_ncol = 3)

Remove Columns with Too Much Missing

Several of the columns have a lot of missing data. These are pages that did not exist until recently in the data, and unfortunately we aren’t going to be able to use them because they will result in data too few rows for the analysis.

  1. Start with transactions_product_page_sessions_weekly_tbl
  2. De-select columns containing “bundle” and anything that ends_with() “with-r/”. These columns have very low data.
  3. Store the subset data as transactions_product_page_subset_tbl
transactions_product_page_subset_tbl <- transactions_product_page_sessions_weekly_tbl %>%
    select(-contains("bundle"), -ends_with("with-r/"))

Transform Data

  1. Remove missing data with drop_na()
  2. Apply cross-wise transformations using mutate() and across():
    • First take the Log Plus 1: log1p()
    • Then standardize to mean 0, std-dev 1: standardize_vec()
  3. Store the transformed data as log_standardized_transactions_product_page_tbl
log_standardized_transactions_product_page_tbl <- transactions_product_page_subset_tbl %>%
    drop_na() %>%
    mutate(across(-purchased_at, .fns = log1p)) %>%
    mutate(across(-purchased_at, .fns = standardize_vec))

Cross Correlations

Visualize cross correlations between revenue and anything that contains("session"):

  1. Start with log_standardized_transactions_product_page_tbl
  2. Use plot_acf_diagnostics() with:
    • .ccf_vars = contains("session")
    • .show_ccf_vars_only = TRUE
    • .facet_ncol = 2
log_standardized_transactions_product_page_tbl %>%
    plot_acf_diagnostics(
        purchased_at,
        revenue,
        .ccf_vars = contains("session"),
        .show_ccf_vars_only = TRUE,
        .facet_ncol = 2
    )

Results

Revenue is positively correlated for roughly 2 days after sessions to 101 course page visits. Perhaps students take a day or two to think about the decision. There may be an opportunity to follow up with those that do not enroll.

Learning Labs have a longer period (up to 7 lags) of a relationship with revenue. Jumpstart dips right away (as students take the free class), but then rises through lag 7 to 10. Perhaps this is when students finish the course and then make a purchase.